PRECC

Section: LOCAL (1L)
Updated: 1 May 1993
Index Return to Main Contents

NAME

precc - Preferred Compiler Compiler 2.40

SYNOPSIS

precc < file.y > file.c

precc file.y > file.c

precc file.y file.c

DESCRIPTION

Precc is a compiler compiler which converts precc-style context-grammar definition scripts (with a .y extension) into C code scripts (with a .c extension). The output code compiles under ANSI C compilers such as the GNU Software Foundation's gcc(1).

There is an easy-to-use hook for lex(1) tokenisers.

Precc extends the Unix yacc(1) utility by allowing:

[0] Contextual definitions. Each grammar definition may be parameterized with contexts. For example, some languages determine whether a declaration is local (and to what) or global in scope by relative indentation, and this can be encoded in precc using the number of spaces indentation as a parameter, n:

: @ decl(n) = space(n) expression <'\n'> decl(n+1)*

This definition is intended to mean that a declaration indented by n spaces consists of n spaces, an expression, and a newline, optionally followed by one or several subdeclarations indented further. The n spaces definition may be encoded as shown in note [3] below.

[1] Infinite lookahead and backtracking in place of the yacc 1-token lookahead, This means that precc parsers distinguish correctly between sentences of the form `foo bah gum' and `foo bah NAY' on a single pass. If you cannot imagine why one should want to decide between the two, think about `if ... then' and `if ... then ... else ... '.

[2] Arbitrarily complex expressions. This means that compound definitions like

: explain {{this | that} {several | no} times}+

are legal within precc definition scripts.

[3] Precc has the postfix operators `*' (zero or more times), `*n' (exactly n times), `+' (one or more times), and `!' (execute accumulated actions now) built in, along with the `[ ]' (optionally) outfix operator. For example, the following means `exactly n spaces':

: @ space(n) = onespace*n

where `onespace' is defined to match against any single white space token.

The other built-ins are

: `?' (any token)
: `^' (beginning of line)
: `$' (end of line)
: `|' (or, placed between alternate phrases of the grammar)
: `{ }' grouping brackets
: `< >' around literals
: `> <' to mean `not a (particular literal)'
: `( )' around the name of a BOOLEAN valued predicate on tokens, defined as an int 1 or 0 -valued C function elsewhere in the script, and
: `) (' (anti-brackets) round a C expression of BOOLEAN type, meaning a logical test condition.
: `]..[' anti-brackets hide an expression, causing it to be required but ignored.

`]a[ b' means that input must satisfy both a and b, while `a ]b[' means that b is trailing context.

`$!' is a shorthand for matching end-of-line (it causes the input buffer to start being overwritten). It is short for the conjunction '$ !', but more efficient, and you should use this whenever you want to get a new line.

`a b c' (conjunction) is the term denoting an expression consisting of an `a expression' followed by a `b expression' followed by a `c expression'. An example of a precc script follows in the section USAGE.

[4] Modular output. Parts of a script can be precc'ed separately, compiled separately, and then linked together later, which makes maintenance and version control easy.

[5] Speed. Precc is fast, typically taking two to five seconds to compile scripts of several hundred lines. And it builds fast parsers too.

[6] Higher order behaviour. `Macros' may be defined in a script. For example,

: @ optional(parser) = parser
: @ | {}

may be defined (this particular example is an equivalent for the built-in `[parser]' construct). And then the construct

: @ ice_cream(flavour) = tub(flavour) optional(sauce)

may be used.

[7] Separate syntax for synthesised attributes and attached actions (2.40 and above). For example, the following synthesises a total for a simple sum without using side-effecting actions and the run-time value stack:

: @ sum = summand\x <'+'> summand\y @x+y

whereas the following uses yacc(1) -style runtime values in an attached action:

: @ sum = summand <'+'> summand :{$$=$1+$3;}:

[8] Built-in error handling capability (2.40 and above). The following code sets the handler `foo' as the parser to use when the parse beyond the `!{foo}' does not match:

: @ typical = okstuff !{foo} morestuff

Malformed parse input will be matched against the parse `okstuff foo', and well-formed input will be matched against `okstuff morestuff'.

Precc is intended to be both easy and convenient to use, but a compiler compiler cannot be understood in one minute. Have a look at the example *.y files in the precc directory to get more of the feel. A more complex line in a grammar definition script than those above may look like:

: @ expr = var { <'+'> | <'-'> } expr
: @ | <'('> expr <')'>

The `@' is an `attention mark'. Every line which does not begin with an `@' is passed through to the output unchanged, so arbitrary C code can be embedded in a precc script. Intended comments must therefore be surrounded by C comment marks, `/*' and `*/'.

A default do-nothing tokeniser is provided in the precc library and will be automatically linked in unless you specify a different yylex() routine to the C compiler. There is nothing to worry about here. If you do nothing yourself, you will get a working parser out of a precc script immediately, but if you particularly want to put your own tokeniser on the input, then you do that by naming it `yylex()' and making it return TOKENs when called. It should write VALUE attributes into `yylval', just like lex(1). Place its object module or source code file ahead of the `-lcc' argument when you use the C compiler, and it will be linked in instead of the default (NB. yylex() must signal EOF to precc by setting `yytchar=EOF', which yylex() routines generated by lex(1) do not seem to get right).

The way to compile a C source code file `foo.c' generated by precc into an executable `foo' is to use an incantation like:

: gcc -Wall -ansi -o foo foo.c -L <precc dir> -lcc

You can change the TOKEN type by #defining it as a C macro in the *.y script (you may want a wider range of TOKENs than the 256 possibilities afforded by the default 8-bit char, and `#define TOKEN short int' is sometimes useful). But it is important that the appropriate precc library is used at link time. The default libcc.a library will usually assume TOKEN=char, but different versions can easily be produced by recompiling the library with TOKEN set to the desired size data.

The parser generated from a precc script will ordinarily signal valid input by absorbing it silently, and signal invalid input by rejecting it and spouting an error message. This is a standard style for compiler-compilers. To get the parser to do anything else, you must decorate the definition script with ACTIONs (see below for details).

The error action may be redefined by #defining an ON_ERROR(x) macro. An x=0 value should give the code to execute on a partial but successful parse and x=1 should give the code to execute on an unsuccessful parse. x=-1 should give code to execute when precc attempts to backtrack across a `cut' (`!', see below). For example:

: #define ON_ERROR(x) x?printf("ow!\n"):printf("ouch!\n")

The default error actions attempt to restart the parse on the next line of input, using the parser p designated by `MAIN(p)' in the script.

You may likewise #define BEGIN and END for C code to be executed at either end of a parse attempt. This means that BEGIN will be re-executed if the parse resyncs after an error, and your code should take account of that (most likely by installing and using a counter).

OPTIONS

Precc can be run as a stdin to stdout filter, taking no options or arguments. It is better practice, however, to use the command line options:

: precc infile outfile

because then there is no danger of precc misidentifying the console or keyboard when you have redirected stdin and stdout.

The default sizes of various internal buffers can be changed by command line options (version 2.40 and above only), as follows:

: -rNNNN The read buffer size in Kb. This determines the maximum char length of a single production in a script readable by precc. Default 2Kb/ 2K chars.
: -pNNNN The maximum size in Kb of the internal program (tables) built by precc during the scan of a specification script. It correlates to the maximum number of symbols in a single production rule. Default 20Kb/4K instructions.
: -vNNNN The maximum size in Kb of the internal data built by precc during the scan of the specification script. Default 16Kb/4K data items.
: -fNNNN The maximal size in Kb of the area used by precc to store backtrack points when scanning a script. It correlates to the maximal number of sequents in a production rule. Default 16Kb/1K breakpoints.

The sizes need only be changed if precc fails to parse an input script returning an error message that indicates an overflow of one of these buffers.

The buffers are also used by utilities built by precc, and their sizes in the utilities are set by the macros READBUFFERSIZE, MAXPROGRAMSIZE, STACKSIZE and CONTEXTSTACKSIZE respectively (see cc.h and ccx.h).

ENVIRONMENT

The following macros must be set in the user's grammar definition script:

# define TOKEN tokentype

(default char) This defines the space reserved for each incoming token in the parser which precc builds. Note that a corresponding version of libcc.a must be linked in later at compile time.

# define VALUE valuetype

(default char*) This defines the space reserved for each value on the runtime stack manipulated by the runtime program which precc attaches to the parser. There is no good reason for changing this to a type which is shorter than long int (or far *char), because the actual space used will be a union type which is at least as long as these.

The following macros can be set if required:

: # define READBUFFERSIZE length

(default 2048) This defines the lookahead token buffer length. No more than <length> tokens can appear between cut marks (`!') in the script, as without cut indicators, precc cannot know if the parser might later backtrack or not, and will not embed buffer reset instructions.

: # define MAXPROGRAMSIZE length

(default 4096) This defines the maximum length of the internal program built by parsers in order to execute attached actions. In call_mode=0 (auto) mode, the stack also contains its own shift instructions, but in call_mode=1 (manual) mode it serves more as a temporary memory area, and usage can be controlled precisely.

: # define STACKSIZE length

(default 4096) This defines the size of the runtime stack used to manipulate yacc-style attached attributes ($1, $2, etc.). Usage is approximately proportional to nesting depth in productions.

: # define CONTEXTSTACKSIZE length

(default 1024) This defines the number of breakpoints that can be held for backtracking. Usage is proportional to the number of sequents in productions between cuts.

: # define C_STACKSIZE length

(default 0x7FFF or 32K) This is the C call stack.

Now for the horrors of synthetic attributes. To get a parser generated by precc to do anything significant, you need either to get it to synthesise a data structure, or get it to generate outputs. Whichever, you usually need to scatter actions through the grammar definition script.

Actions are pieces of C code (terminated by a semi-colon) and placed between a pair of colons (`: ... :') in the grammar definition script. It is good style to make the interior a C block, so the terminators become `:{ ... }:'. For example:

: @ addexpr = expr <'+'> expr :{ $$=$1+$3; }:

is not unreasonable. The `$$' can be replaced by `VV(3)', telling precc explicitly that the addexpr manipulates three values, notionally attached to each of the terms on the right of the equals sign, but this will be effected automatically if the default `call_mode=0' is selected. `$1' is the value attached to the leftmost term, and `$3' is that attached to the rightmost term. The `$1' may be replaced by the explicit `V(1)' within C macros (their contents are not directly accessible to precc). `Values' attached to each term of a precc expression are an easy way to think of what is going on, so long as none of the values has a `side-effect' like printing a value, or updating an array (see below).

If the non-default `call_mode=1' is set, then a VV(n) should appear on the left hand side of the action code, and an easy way to do this is to put it on the left of the first equals sign (assignment in C). This mode is only used by experts in order to optimize the generated code, and it should never be used by beginners or when prototyping. After the VV(n) call in call_mode=1 mode, the $n values will be correctly aligned with the grammar expressions, and without it, they will not be (Note: in versions of precc 1.50 or later, the calling convention is set to `call_mode=0' by default, which circumvents the need for VV(n) calls at the expense of some runtime speed).

The value to be associated with the whole expression should be written into $$ (or `VV(n)'). This is all exactly the same as the treatment in the Unix yacc(1) utility, and it also allows you to incorporate the same incomprehensible trick of pulling values off the stack when they are notionally `further to the left' than your current expression, using $0 or even lower references. Take care!

Side-effecting values need a little explanation. Because precc is an infinite look-ahead parser, it cannot execute actions at the same time as it reads input. It might have to later backtrack across its parse, and, whilst it might deconstruct data structures built up in the parse, it is certainly impossible to undo writes to stdout which might have occurred.

So precc builds a program as it parses. When the parse finishes correctly, the program is executed by an internal engine, but if the parse is unsuccessful or has to be backtracked, the program is `unbuilt' before its actions are executed. This program is a linear sequence of C code actions which have been specified in the precc definition file. Thus the specification:

: @abc=a b c :{printf("D");}:
: @a=<'a'> :{printf("A");}:
: @b=<'b'> :{printf("B");}:
: @c=<'c'> :{printf("C");}:

will, upon receiving input "abc", generate the program

: printf("A");printf("B");printf("C");printf("D");

to be executed later. Thus actions attached to a sequence expression may be thought of as occurring immediately after the actions attached to sub-expressions, and so on down. That explanation should enable you to generate side-effects in the correct sequence.

As remarked above, in version 1.50 of precc and later, using the default call_mode=0 mode, it is not necessary to set VV(n) counts in actions. In earlier versions, it is, if you want to stay sane. Precc 1.50 and later will do all runtime stack manipulations for you if you set

: call_mode=0;

in the BEGIN macro. This actuates the `caller is responsible' (same as C) convention for stack shuffling, and precc is the caller. The alternative convention, `callee is responsible' (same as Pascal), and you are the callee, is set by

: call_mode=1;

and this was the only convention available under versions earlier than 1.50. The C convention should be backwards compatible with the Pascal one provided that no deliberate `tricks' appear in the script. It is extremely unlikely that a novice will have known or used any.

USAGE

Precc grammar description files conventionally have the .y suffix, and should follow the following format:

: # define TOKEN ... (default = char)
: # define VALUE ... (default = char*)
: # define BEGIN ... (default nothing)
: # define END ... (default nothing)
: # define ON_ERROR(x) ... (defaults to standard)
: # include "cc.h" (or ccx.h)
: (sometimes # include "precc.h" may be useful too)
: @ first definition : attached action; :
: @ ...
: @ ...
: MAIN(name of entry parser)

The cc.h header file may be used instead of ccx.h in scripts which consist only of unparameterized definitions and terms.

EXAMPLE

The following script defines a simple +/- calculator:

: # define TOKEN char
: # define VALUE int
: # define BEGIN call_mode=0;
: # include "cc.h"
: # include <ctype.h>
: static int acc;
: @ digit = (isdigit) :{$$=$1-'0';
: @ acc=acc*10+$1;}:
: @ posint= digit posint :{$$=$2;}:
: @ | digit :{$$=$1;acc=0;}:
: @ int = <'-'> posint :{$$=-$2;}:
: @ | posint
: @ atom = <'('>expr<')'> :{$$=$2;}:
: @ | int
: @ expr = atom sign_sum :{$$=$1+$2;}:
: @ | atom
: @ sign_sum= <'-'> atom sign_sum
: @ :{$$=-$1+$3;}:
: @ | <'-'> atom :{$$=-$2;}:
: @ | <'+'> atom sign_sum
: @ :{$$=$1+$3;}:
: @ | <'+'> atom :{$$=$2;}:
: MAIN(expr)

This script must be passed through precc:

: precc < calculator.y > calculator.c

and then compiled, using the precc kernel library in libcc.a:

: gcc -Wall -ansi -o calculator calculator.c -L ... -lcc

The three dots stand for the directory in which the precc library file libcc.a has been placed.

Note that `$$=$1' (`VV(1)=V(1)') has no overall effect, so it has been dropped from most of the points in the script where it might have been expected.

Also note that it would have been more efficient to use an optional following action and write

: @expr = atom [ sign_sum :{$$=..}: ]

instead of

: @expr = atom sign_sum :{$$=..}: | atom

because the latter expression will build a parser which needlessly checks twice for atom when no sign_sum follows.

For an example of a parser which uses parameters, the following definition of a parser which accepts only the fibonaci sequence as input may be useful:

: # define TOKEN char
: # define VALUE char*
: # define BEGIN call_mode=0;
: # include "ccx.h"
: # include <math.h>
: # define INT(x) (int)(x)
: # define DIV(m,n) INT(INT(m)/INT(n))
: # define MOD(m,n) INT(INT(m)%INT(n))
: # define LOG10(n) INT(log10((double)(n)))
: # define DBLE(n) (double(n)
: # define TEN DBLE(10)
: # define FIRSTDIGIT(n) \ (0!=n)?DIV((n),pow(TEN,DBLE(LOG10(n)))):0
: # define LASTDIGITS(n) \ (0!=n)?MOD((n),pow(TEN,DBLE(LOG10(n)))):0
: # define INIT VV(0)=0
: # define STEP VV(2)=V(0)+1
: # define TOTL VV(2)="";printf("%d terms OKext terms are\ %d,%d,..,V(0),a,b)
: MAIN(fibs)
: @fibs = { :{INIT;}: fib(1,1) ! }*
: @fib(a,b) = number(a) <','> :{STEP;}: fib(b,a+b)
: @ | <'.'> <'.'> :{TOTL;}:
: @number(n)= digit(n)
: @ | digit(FIRSTDIGIT(n)) number(LASTDIGITS(n))
: @digit(n) = <n+'0'> :{/* rep. of 1 digit n */}:

The following are some example inputs and responses:

: 1,1,2,3,5,..
: 5 terms OK
: Next terms are 8,13,..
: 1,1,2,3,5,8,13,21,34,51,85,..
: error: failed parse: probable error at <>1,85,..

FILES

The following files are to be found in the /users/news/precc directory:

precc: Precc executable
precc.y: Precc definition in its own language
precc.c: Precc C source code (tweaked from that generated by precc from precc.y).
precc.h: Precc header file, needed only to construct precc.
ccx.c: The source code of the precc 1.0 kernel operations, needed to make ccx.o, included in libcc.a.
cc.c: The source code of the unparameterized precc 1.0 kernel operations, needed to make cc.o, included in libcc.a.
ccx.h: The header file of the precc 1.0 kernel operations, needed by every code constructed by precc.
cc.h: The header file of the unparameterized precc 1.0 kernel operations, an alternative to cc.h if you do not use parameterized definitions.
yystuff.c: Default lexer which allows you to escape newlines.
on_error.c: Default error routines.
libcc.a: The library containing cc.o, ccx.o and yystuff.o, needed to compile an executable from code built by precc.
Makefile: The makefile for precc.
test.y: Simple test script for precc.
test.c: C output from the test.y script.
test: The test parser built by `gcc -ansi -o test test.c -L ... -lcc'.

AUTHOR

Peter Breuer, Programming Research Group, Oxford University Computing Laboratory, UK.
Man page also hacked by Jonathan Bowen.

BUGS

1. On Sun3's, the gcc compiler still complains that printf is being redefined. I don't know why. If anyone finds the right compiler switch to magic this away, please tell me! For the hp300 series, the switch is -D__hp9000s300, if that's any clue?

2. (Cured Mar 10 1992 in v1.1)

3. If you drastically change the type of VALUE in your script (make it larger than char*), you will also have to change the type declared in cc.h and then recompile the libcc.a library. This is not a bug but a feature.

4. (Cured Mar 17 1992 in v1.2).

5. It has been reported that the IBM `ANSI' C compiler does not like the

: typedef STATUS PARSER();

definition made by precc. That is their problem.

7. (patch issued for precc 2.30+ April 15 1993). Error in p_uniq0 code prevented recognition of all backtrack errors, with the effect that they were caught as failed parses some time later instead.

8. (patch will be issued for 2.40+). Precc's C expresions don't permit the use of `.' as an operator. My omission. Use a macro instead until corrected.

Please report problems to <Peter.Breuer@comlab.ox.ac.uk>.

NOTES

The precc 1.0 utility is line buffered, and so is every program it constructs. That is because of the `infinite lookahead' capability: it would wait till the end of an input file before delivering any output, given the chance! So I caused each input line to be treated separately, with the side effect of limiting all constructs to one line in length. The top-level parser declared in MAIN(..) is repeated at every line, and errors (by default) cause a restart attempt on the next line. If you don't like long lines, a terminating escape (\) will continue one line on to the next (when you use the default yylex() built into the library).

In version 1.30 and above, lines could be continued by placing an `@' at the beginning of the next line, without a `\' at the end of the previous line. Each sequence of `@' continued lines must be terminated by an empty line.

Version 1.40 introduced a hook TOKEN *yybuffer for external lexical analysers. This is where lexers must eventually write their output for precc to see it. Version 2.0 and above use a special routine mygets() to call yylex() which places the TOKEN returned by yylex() in the right place automatically. For backwards compatibility, it is still possible to write into yybuffer directly, however. Note that, as mentioned already, EOF is tested by looking at the global int yytchar. The default yylex() lexer in libcc.a handles all this correctly (It just permits newlines to be escaped when preceded by '' in the input stream).

The default zer_error() handler supplied with precc simply prints an error message and the unparsed portion of the string. That might well be all of the string, since precc parsers try their darndest to make a match, then backtrack, so the (TOKEN *)maxp pointer is provided. This points to the deepest successful penetration into the incoming string, and is usually the point to look for the error. The pointer (TOKEN *)pstr shows the unparsed string, of which (TOKEN *)maxp will be an end-segment.

If you want to try and resync the parse at an error, a sensible thing to do would be to (rewrite zer_error to) skip a token at maxp, and rerun the parse. You will have to read the code of the run() function defined in cc.c to make sense of it, but you might try:

: strcpy(maxp,maxp+1);tok=the_top_level_parser();\ if(GOODSTATUS(tok)){\ pc=0;pc=evaluate());\ }\ else printf("At least I tried!))

Using a counter to set a maximal number of resync attempts in a single line would also be sensible!.

You can obviate any bad_error() call by making sure that the top-level parser has a failsafe fallthrough to a ?* parser, with some kind of error action attached.

The version 2.x series precc extended version 1.x by allowing parameters to each clause of the grammar (i.e., it treats inherited attribute grammars as well as synthetic ones), and by introducing the `!' (cut) marker. This can be inserted in expressions in order to stop backtracking through that point, which is useful in avoiding excessively long searches for alternate parses when no alternate is possible.

Promises: version 2.x will eventually eliminate the archaic yacc-style of stack manipulation with something much nicer. Version 3.0 should implement tight type-checking. Contact the author for the most recent version.

This document was created by man2html, using the manual pages.
Time: 00:30:15 GMT, March 30, 2022